Deadline: July 23rd, 23:00
Academic Integrity
This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
Please fill out the following:
Download your notebook: File -> Download .ipynb
Click on the Files icon on the far left menu of Colab
Select & upload your .ipynb file you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up)
execute the following in a Colab cell:
%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
An HTML version of your notebook will appear in the files, so you can download it.
Submit both `HTML` and `IPYNB` files on Quercus for grading.
In this project we work on a Covid-19 dataset that reports the number of confirmed cases per day and Fashion MNIST.
!jupyter nbconvert --to html /content/Jashwant_Raj_Gunaseelan_S22_APS1070_Project_3.ipynb
import pandas as pd
cases_raw = pd.read_csv(
filepath_or_buffer='https://raw.githubusercontent.com/aps1070-2019/datasets/master/APS_COVID_Jan22.csv',
index_col=0,
thousands=','
)
Write a function to do the following: [0.25]
Apply StandardScalar to the data. Each day should have a mean of zero and a StD of 1. [0.25]
step 1 on the standardized dataset for the US, China, and Canada. [0.25]US, Canada, and China. What does it mean if the curve goes up or down (are the number of Covid cases negative?) What does the sign of values indicate? [0.25]#Part 1 Question 1
from IPython.core.pylabtools import figsize
import seaborn as sns
%matplotlib inline
def Plotting(Countries,Dates):
fig,axs= plt.subplots(6,figsize=(35, 65))
j=0
for j in range(6):
#print(j,Dates.shape,len(Scaled_Data.loc[Countries[j]].values),Countries[j])
plot_=sns.lineplot(ax=axs[j], x=Dates, y=Scaled_Data.loc[Countries[j]].values)
axs[j].set_title(Countries[j])
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xticklabels(Dates,rotation = 30)
plot_.set_xlabel("Dates")
plt.show()
#Part 1 Question 2
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
scaler=StandardScaler()
Scaled_data=scaler.fit_transform(cases_raw)
index_values,column_values=cases_raw.index,cases_raw.columns
Scaled_Data=pd.DataFrame(data=Scaled_data,index=index_values.values,columns=column_values.values)
Scaled_Data
#Part 1 Question 3
Plotting(['US','Canada','China','India','Germany','Japan'],column_values)
Part 1 Question 4:
After standarization we can observe that the data represenation scale for the number of cases has been brought to a more neutral standpoint for different countries. Considering we take the mean to be the 0, balance point
When the curve goes up, we can see the number of cases are more than the mean. Hence there is an increase in the cases for that particular country and are positive. When the cases are below 0 which is our mean in this case, we can see that the cases are decreasing for that country, hence the values are negative.
US: When we observe the graph for US we see that initially for a small period of time the cases were not high, they were close to the mean, but as the days went on the cases started to increase drastically and the curve doesn't seem to be getting any closer to the mean.
Canada: The cases initially are lesser than the mean which was good ,but like US the cases in Canada also increase drastically and then also reduces gradually and then goes on an unstead curve that is not too far from the mean.
China: The curve starts of really high, very far off from the mean but then gradually gets closer to the mean and stay very close to the mean with very little deviation from the mean, meaning the cases are not too high.
get_sorted_eigen(df_cov) that gets the covariance matrix of dataframe df (from step 1) and returns sorted eigenvalues and eigenvectors using np.linalg.eigh. [0.25]scree plot(Bars for each component and a line to show cumulative --similar to tutorial. Limit x-axis if needed to better see the plot). [0.25]### YOUR CODE HERE ###
import numpy as np
S_Data=Scaled_Data.values
n,m=S_Data.shape
CovarMat=np.dot(S_Data.T,S_Data)/ (n-1)
print(CovarMat.shape)
def get_sorted_eigen(df_cov):
Eigenvalues,Eigenvectors=np.linalg.eigh(df_cov)
Eigenvalues=Eigenvalues.tolist()
Eigenvalues.reverse()
Eigenvectors=Eigenvectors[:,::-1]
return Eigenvalues,Eigenvectors
EigValues,EigVector=get_sorted_eigen(CovarMat)
eigValSum = sum(EigValues)
expVar = [eigV/eigValSum*100 for eigV in EigValues]
cumExpVar = np.cumsum(expVar)
plt.figure(figsize=(30,5))
plt.bar(range(len(expVar[:170])), expVar[:170], label='Explained Variance')
plt.plot(cumExpVar[:170], 'r-o', label='Cumulative Explained Variance')
plt.legend()
plt.title('Screen Plot')
plt.show()
print(f"The number of PC required to cover 98% of the dataset is {len(cumExpVar[cumExpVar<99.00])}")
def Plotting2(EigVector,Dates):
fig,axs= plt.subplots(8,2,figsize=(35, 85))
plots=0
for i in range(2):
for j in range(8):
plot_=sns.lineplot(ax=axs[j][i], x=Dates, y=EigVector[:,plots])
#print(EigVector[:,plots])
axs[j][i].set_title('PC'+str(plots+1))
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xticklabels(Dates,rotation = 60)
plot_.set_xlabel("Dates")
plots=plots+1
plt.show()
Plotting2(EigVector,list(column_values.values)[::-1])
As we see from the screen plot, there are only a handful of Principle components out of the total PC's that we need to get 98% of the data. Those are the first 3 PCs wth high eigen values. PC1 has the most variance and the variance keeps on decreasing. In the above plot we see the curve is smooth and as the PC's increase the smoothess of the curve becomes less making it more noisy.
Create a function that:
Plots 4 figures:
The incremental reconstruction of the original (not standardized) time-series for the specified country in a single plot. [1]
You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):
Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler inverse_transform help...
(df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error. Test your function using the US, Canada, and China as inputs. [0.5]
from sklearn.metrics import mean_squared_error
def plot_country_figures(original_df, country_name):
### Finding the eigenvalues and eigenvector ###
scaler=StandardScaler()
Scaled_data=scaler.fit_transform(original_df)
row_values=original_df.index.to_numpy()
indexx=np.where(row_values == country_name)
n,m=Scaled_data.shape
CovarMat=np.dot(Scaled_data.T,Scaled_data)/ (n-1)
EigValues,EigVector=get_sorted_eigen(CovarMat)
RMSE=[]
#Original Time Series
fig,axs= plt.subplots(2,2,figsize=(65, 45))
plot_=sns.lineplot(ax=axs[0,0], x=original_df.columns.values, y=original_df.loc[country_name])
axs[0,0].set_title(country_name,fontsize=20)
axs[0,0].grid(visible=True,which='both',linestyle='--')
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xticklabels(original_df.columns.values,rotation = 60,fontsize=20)
plot_.set_xlabel("Dates",fontsize=20)
#Reconstruction with only PC1
for In,Va in enumerate([1,2,4,8,16]):
PC=Va
W = EigVector[:,0:PC]
projection=np.dot(Scaled_data,W)
# Reconstruction
ReconX = np.dot(projection, W.T)
Recon_original=scaler.inverse_transform(ReconX)
RMSE.append(mean_squared_error(original_df.values[indexx][0],Recon_original[indexx][0],squared=False))
plot_=sns.lineplot(ax=axs[0,1], x=original_df.columns.values, y=Recon_original[indexx][0],linewidth = 5)
plot_.legend(labels=["PC=1","PC=2","PC=4","PC=8","PC=16"], fontsize = 20)
plot_.set_xticklabels(original_df.columns.values,rotation = 60,fontsize=20)
axs[0,1].set_title('PC values',fontsize=20)
axs[0,1].grid(visible=True,which='both',linestyle='--')
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xlabel("Dates",fontsize=20)
#Residual Error with best reconstruction
PC=16
W = EigVector[:,0:PC]
projection=np.dot(Scaled_data,W)
ReconX = np.dot(projection, W.T)
Recon_original=scaler.inverse_transform(ReconX)
Residual_Error=original_df.values-Recon_original
plot_=sns.lineplot(ax=axs[1,0], x=original_df.columns.values, y=Residual_Error[indexx][0],linewidth = 5)
plot_.legend(labels=["PC=16"], fontsize = 20)
plot_.set_xticklabels(original_df.columns.values,rotation = 60,fontsize=20)
axs[1,0].set_title('Residual Error',fontsize=20)
axs[1,0].grid(visible=True,which='both',linestyle='--')
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xlabel("Dates",fontsize=20)
#RMSE
plot_=sns.lineplot(ax=axs[1,1], x=[1,2,4,8,16], y=RMSE)
plot_.legend(labels=["RMSE"], fontsize = 20)
axs[1,1].set_title('RMSE',fontsize=20)
axs[1,1].grid(visible=True,which='both',linestyle='--')
plot_.set_xlabel("Principle Components",fontsize=20)
plot_.set_ylabel("RMSE",fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
plot_country_figures(cases_raw,'US')
plot_country_figures(cases_raw,'China')
plot_country_figures(cases_raw,'Canada')
plot_country_figures(cases_raw,'India')
Modify your function in part 3 to use SVD instead of PCA for extracting the eigenvectors. [0.5]
Explain if standardization or covariance computation is required for this part.[0.5]
Repeat part 3 and compare your PCA and SVD results. [1]
### YOUR CODE HERE ###
def SVD(original_df, country_name):
### Finding the eigenvalues and eigenvector ###
row_values=original_df.index.to_numpy()
indexx=np.where(row_values == country_name)
U, S, V = np.linalg.svd(original_df.values.T)
RMSE=[]
#Original Time Series
fig,axs= plt.subplots(2,2,figsize=(65, 45))
plot_=sns.lineplot(ax=axs[0,0], x=original_df.columns.values, y=original_df.loc[country_name])
axs[0,0].set_title(country_name,fontsize=20)
axs[0,0].grid(visible=True,which='both',linestyle='--')
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xticklabels(original_df.columns.values,rotation = 60,fontsize=20)
plot_.set_xlabel("Dates",fontsize=20)
#Reconstruction with different singular value
for In,Va in enumerate([1,2,4,8,16]):
Singular=Va
Recon_original = (U[:,0:Va] * S[0:Va])@V[0:Va,:]
RMSE.append(mean_squared_error(original_df.values[indexx][0],Recon_original.T[indexx][0],squared=False))
plot_=sns.lineplot(ax=axs[0,1], x=original_df.columns.values, y=Recon_original.T[indexx][0],linewidth = 5)
plot_.legend(labels=["Singular value = 1","Singular value = 2","Singular value = 4","Singular value = 8","Singular value = 16"], fontsize = 20)
plot_.set_xticklabels(original_df.columns.values,rotation = 60,fontsize=20)
axs[0,1].set_title('Singular values',fontsize=20)
axs[0,1].grid(visible=True,which='both',linestyle='--')
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xlabel("Dates",fontsize=20)
#Residual Error with best reconstruction
Singular=16
Recon_original = (U[:,0:Va] * S[0:Va])@V[0:Va,:]
Residual_Error=original_df.values-Recon_original.T
plot_=sns.lineplot(ax=axs[1,0], x=original_df.columns.values, y=Residual_Error[indexx][0],linewidth = 5)
plot_.legend(labels=["Singular value = 16"], fontsize = 20)
plot_.set_xticklabels(original_df.columns.values,rotation = 60,fontsize=20)
axs[1,0].set_title('Residual Error',fontsize=20)
axs[1,0].grid(visible=True,which='both',linestyle='--')
for ind, label in enumerate(plot_.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept(reduce the x axis labels)
label.set_visible(True)
else:
label.set_visible(False)
plot_.set_xlabel("Dates",fontsize=20)
#RMSE
plot_=sns.lineplot(ax=axs[1,1], x=[1,2,4,8,16], y=RMSE)
plot_.legend(labels=["RMSE"], fontsize = 20)
axs[1,1].set_title('RMSE',fontsize=20)
axs[1,1].grid(visible=True,which='both',linestyle='--')
plot_.set_xlabel("Singular values",fontsize=20)
plot_.set_ylabel("RMSE",fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
SVD(cases_raw,'US')
plot_country_figures(cases_raw,'US')
SVD(cases_raw,'China')
plot_country_figures(cases_raw,'China')
SVD(cases_raw,'Canada')
plot_country_figures(cases_raw,'Canada')
SVD(cases_raw,'India')
plot_country_figures(cases_raw,'India')
For SVD we dont standardise the data or calculate covariance matrix as we do not calculate the eigenvalues or eigenvectors. We calcualte the right singular matrix(V) and the left singular matrix(U) and the S is the diagonal matrix with singular values as the diagonal elements.
When we compare the results of both SVD and PCA reconstructions we see that the residual error for countries such as US, Canada, India are lesser when we perform SVD than PCA and for countries such as China, the residual error is more when we perform SVD than PCA.
Based on the observations for reconstruction it takes lesser number of singular values than principle components. What may require 4 PC's to acheive in PCA make take 2 Singular values in SVD
Fashion-MNIST is a dataset for clothes. Each image in Fashion-MNIST has 28x28 pixels which can be represented in an array with 784 elements. You can take a better look at this dataset in this link
In this part, we are going to use PCA to compress these images. The $x$ matrix below has 1000 images.
from sklearn.datasets import fetch_openml
import pandas as pd
mnist = fetch_openml("Fashion-MNIST")
x = mnist.data[0:1000]
y = mnist.target[0:1000]
target_encoding = {0: "T-shirt/top",
1: "Trouser",
2: "Pullover",
3: "Dress",
4: "Coat",
5: "Sandal",
6: "Shirt",
7: "Sneaker",
8: "Bag",
9: "Ankle boot"}
x.shape
import matplotlib.pyplot as plt
plt.gray()
ind = 15
plt.imshow(x.loc[ind].values.reshape(28,28))
print ("Label is:", target_encoding[int(y[ind])])
plt.show()
x['Class'] = y
x
#Creating new Dataframe
DF=x.loc[(x['Class']=='3') | (x['Class']=='4') | (x['Class']=='6')]
DF=DF.drop('Class', axis=1)
DF.values
DF.values.shape
x=x.drop("Class",axis=1)
DF
import numpy as np
import random as rd
# Finding out the EigenValues and EigenVectors
U, S, Vh = np.linalg.svd(DF.values)
Sigma = np.zeros((DF.shape[0], DF.shape[1]))
Sigma[:DF.shape[1], :DF.shape[0]] = np.diag(S)
EigenValues=(Sigma**2)/(S.shape[0]-1)
EigenVectors=Vh.T
import math
# Plotting the first 10 eigenvectors
COUNT = 10
ROWS = math.ceil(COUNT/3)
fig = plt.figure(figsize=(12, ROWS * 4))
for i in range(0, COUNT):
plt.subplot(ROWS, 3, i+1)
plt.imshow(EigenVectors[:,i].reshape(28, 28), cmap = plt.cm.gray)
plt.title('#{}'.format(i+1))
def plot_reconstruction(n):
plt.figure(figsize=(10,15))
plt.subplot(1,2,1)
plt.gray()
ind = DF.index.values[rd.randint(0,len(DF.index.values)-1)]
plt.imshow(DF.loc[ind].values.reshape(28,28))
plt.title(f"Original Image, Label is: {target_encoding[int(y[ind])]}")
#Reconstruction with only PC
plt.subplot(1,2,2)
PC=n
W = EigenVectors[:, 0:PC]
projection=np.dot(DF.loc[ind].values.reshape(1,-1),W)
# Reconstruction
ReconX = np.dot(projection, W.T)
Image=ReconX.reshape(28,28)
plt.imshow(Image, cmap = plt.cm.gray)
plt.title(f"Reconstructed Image with {PC} principle component")
for check in range(1,785,100):
plot_reconstruction(check)
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
PCA:
Covid Data: